Introduction

We used the R package CoordinateCleaner to flag potentially erroneous, suspect, or imprecise geographical coordinates based on geographic gazetteers and metadata. It includes a series of tests for identifying records assigned to country capital, provinces and country centroids, coordinates in urban areas, around biodiversity institutions or GBIF headquarters. It also contains tests to flag coordinates below a determined precision (e.g., 100 km), zero or equal coordinates, and duplicated records (i.e., equal taxa name and coordinates).

Note that we do not use the “seas” test to remove records in the ocean because such records we previously removed in the pre-filter step of the workflow (more details here).


Important:

The results of each test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.

Installation

You can install the released version of ‘BDC’ from github with:

if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")

Creating folders to save the results

Read the database

Read the database created in the taxonomystep the BDC workflow. It is also possible to read any datasets containing the required fields to run the workflow (more details here

database <-
  qs::qread("Output/Intermediate/02_taxonomy_database.qs")

Standardization of character encoding.

for (i in 1:ncol(database)){
  if(is.character(database[,i])){
    Encoding(database[,i]) <- "UTF-8"
  }
}



Flag common spatial issues

check_space <-
  CoordinateCleaner::clean_coordinates(
    x =  database,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    species = "scientificName",
    countries = ,
    tests = c(
      "capitals",     # records within 2km around country and province centroids
      "centroids",    # records within 1km of capitals centroids
      "duplicates",   # duplicated records
      "equal",        # records with equal coordinates
      "gbif",         # records within 1 degree (~111km) of GBIF headsquare
      "institutions", # records within 100m of zoo and herbaria
      "outliers",     # outliers
      "zeros",        # records with coordinates 0,0
      "urban"         # records within urban areas
    ),
    capitals_rad = 2000,
    centroids_rad = 1000,
    centroids_detail = "both", # test both country and province centroids
    inst_rad = 100, # remove zoo and herbaria within 100m
    outliers_method = "quantile",
    outliers_mtp = 5,
    outliers_td = 1000,
    outliers_size = 10,
    range_rad = 0,
    zeros_rad = 0.5,
    capitals_ref = NULL,
    centroids_ref = NULL,
    country_ref = NULL,
    country_refcol = "countryCode",
    inst_ref = NULL,
    range_ref = NULL,
    # seas_ref = continent_border,
    # seas_scale = 110,
    urban_ref = NULL,
    value = "spatialvalid" # result of tests are appended in separate columns
  )
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing equal lat/lon
#> Flagged 0 records.
#> Testing zero coordinates
#> Flagged 1 records.
#> Testing country capitals
#> Flagged 10 records.
#> Testing country centroids
#> Flagged 10 records.
#> Testing urban areas
#> Downloading urban areas via rnaturalearth
#> OGR data source with driver: ESRI Shapefile 
#> Source: "C:\Users\Bruno Ribeiro\AppData\Local\Temp\RtmpmWDtWH", layer: "ne_50m_urban_areas"
#> with 2143 features
#> It has 4 fields
#> Integer64 fields read as strings:  scalerank
#> Flagged 279 records.
#> Testing geographic outliers
#> Flagged 10 records.
#> Testing GBIF headquarters, flagging records around Copenhagen
#> Flagged 0 records.
#> Testing biodiversity institutions
#> Flagged 11 records.
#> Testing duplicates
#> Flagged 97 records.
#> Flagged 390 of 6112 records, EQ = 0.06.

Flag coordinates with low decimal precision

Identification of records with a coordinate precision below a specified number of decimal places. For example, the precision of a coordinate with 1 decimal place is 11.132 km at the equator, i.e., the scale of a large city.

check_space <-
  bdc_coordinates_precision(
    data = check_space,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    ndec = c(0, 1) # number of decimals to be tested
  )
#> bdc_coordinates_precision:
#> Flagged 50 records
#> One column was added to the database.

Mapping spatial errors

It is possible to map a column containing the results of one spatial test. For example, let’s map records in country or provinces centroids.

check_space %>%
  dplyr::filter(.cen == FALSE) %>%
  bdc_quickmap(
    data = .,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    col_to_map = ".cen",
    size = 0.7
  )


Coordinates in country and province centroies

Report

Creating a column named “.summary” summarizing the results of all tests. This column is “FALSE” if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).

check_space <- bdc_summary_col(data = check_space)
#> Column '.summary' already exist. It will be updated
#> 
#> bdc_summary_col:
#> Flagged 467 records.
#> One column was added to the database.



Creating a report summarizing the results of all tests.

report <-
  bdc_create_report(data = check_space,
                    database_id = "database_id",
                    workflow_step = "space")
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the space in:
#> Output/Report

report


Figures

Creating figures (bar plots and maps) to facilitate the interpretation of the results of data quality tests.

bdc_create_figures(data = check_space,
                   database_id = "database_id",
                   workflow_step = "space")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures


Rounded coordinates (% of each database flagged)


Records within biodiversity institutions


Summary of all tests


Filter the database

It is possible to removed flagged records (potentially problematic ones) to get a ‘clean’ database (i.e., without test columns starting with “.”). However, to ensure that all records be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal steps of the workflow), potentially erroneous or suspect records will be removed in the final step of the workflow.

# output <-
#   check_space %>%
#   dplyr::filter(.summary == TRUE) %>%
#   bdc_filter_out_flags(data = ., col_to_remove = "all")

Save the database

check_space %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "03_space_database.qs"))